Query and visualize our current YouTube statistics

Git doesn't support in text .ipynb hyperlinks, to use them checkout the nbviewer for this notebook

Abstract:

This notebook sql queries our video_stats table and creates a new potential sql view for later on.
All data is based on playlist of 306 videos, titled Important Videos, that went viral around 2012.
A lot of these videos are weird memes and some are actually funny but there is a range for sure

Prerequisites

  1. You will need to acquire data over a few days to run this notebook on your own machine/postgres db
  • otherwise follow along with the dataset shown here

Goals:

  1. Create SQL query of stats over time from our exisiting data: viewcount, commentcount, likecount, dislikecount
  2. View our new stats with interactive plotly charts to find neat trends
In [1]:
from datetime import datetime
import pandas as pd
import math
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
from psycopg2 import sql, connect
import psycopg2
import pandas as pd
import plotly.express as px
from IPython.display import IFrame

#import credentials (database host address) from your youtube_config.py
import sys
sys.path.insert(1, '../../')
from youtube_config import host

Lets build a few functions so we can easily watch the videos we are researching.

1. make_clickable() and make_df_url()

 - Both Make clickable HTML links for our video urls so we can take breaks from our research to see if we think the videos we're researching deserve their notoriety

2. findVideo()

  - This functions lets us search any string in the video title and returns it's most recent stats. 
  - Think of this function like a google search but just for this specific set of videos
In [2]:
def make_clickable(val):
    """
    Makes hyperlinks into clickable html
    
    val: input url
    """
    return '<a href="{}">{}</a>'.format(val,val)
In [3]:
def make_df_url(df):
    """
    Returns data frame with clickable 'pl_url' columns links 
    
    df: input data frame
    """
    return df.style.format(make_clickable, subset='pl_url')
In [4]:
def findVideo(input_str=None):
    """
    Search any string within video title. Returns most recent stats of videos with search string
    in the title as a dataframe
    
    input_str: input video title string
    """
    input_str = input_str.lower()
    returndf = sql_df[sql_df['title'].str.lower().str.contains(input_str)].sort_values(
                by='datetime', ascending=False).drop_duplicates('title').sort_values(by="title")
    
    return returndf.style.format(make_clickable, subset='pl_url')

Define database connections and create connection conn

In [5]:
db_name = "youtube_test"
user = "postgres"
host = host
password = "mypw"

try:
    # declare a new PostgreSQL connection object
    conn = connect(
        dbname = db_name,
        user = user,
        host = host,
        password = password
    )

    # print the connection if successful
    print ("psycopg2 connection:\n", str(conn).split(' ')[5:8], "host=host", str(conn).split(',')[1] )
    cur = conn.cursor()

except Exception as err:
    print ("psycopg2 connect() ERROR:", err)
    conn = None
psycopg2 connection:
 ["'user=postgres", 'password=xxx', 'dbname=youtube_test'] host=host  closed: 0>

SQL query to create new statistics from out current data

  • It probably would have been much quicker to create this dataset in Python
  • I did it in SQL so later on we can do ELT where we transform internal database data after loading.

This query can be difficult to read outside of a native sql text editor, so I outline the structure here:

Subquery to pull our newly defined stats ( CTE used to generate our first set of new stats and values to confirm we create clean data

      The CTE generates deltas based on current stats of the video minus the second most recent record
        EX: viewcount_delta = viewcount_t1 - viewcount_t0
      The deltas are made by ordering records OVER (ORDER BY title, vid_date, vid_TIME) and subtracting the
      previous record with the LAG() window function
        EX: viewcount_delta = viewcount - LAG(viewcount) OVER (ORDER BY title, vid_date, vid_TIME)
LAG previous title to check if it matches current title. Will clean data
WHERE title = last_title.
CASE WHEN record title matches previous record title
                When condition makes sure deltas are only calculated between the same column title.
                Prevents edge cases where the previous title rolls over to a new title.
THEN subtract stats to make deltas.
END

JOIN on another table that has title because our curret table does not. )
SELECT new deltas FROM CTE and create time_delta_min and view_per_min stats WHERE title = last_title ensures all deltas were calc'd between the same title. AND view_delta IS NOT NULL filters out edge cases where title != last_title

In [6]:
try: 
    cur.execute("""
--subquery to generate final view of stats
SELECT title, pl_url, position, videopublishedat, vid_date,
        viewcount, last_view, view_delta, view_per_min, 
        datetime, last_time, time_delta, time_delta_min, 
        commentcount, commentcount_delta, 
        likecount, likecount_delta,
        dislikecount, dislikecount_delta
FROM 
(
--CTE used to generate stats from viewcount, likecount, etc. The CTE is transforming the raw data into more powerful data
WITH cte AS ( 
        SELECT viewcount, likecount, dislikecount, commentcount, 
                vid_date, vid_time, title, playlists.position, pl_url, videopublishedat,
                
        --check last_title matches current title to ensure data is clean when calc'ing deltas with LAG()
        LAG(title) OVER (ORDER BY title, vid_date, vid_TIME) AS last_title,
        
        --check last_view count to ensure data is clean when calc'ing view_delta with LAG()
        LAG(viewcount) OVER (ORDER BY title, vid_date, vid_TIME) AS last_view,
        
        --check last_time date to ensure data is clean when calc'ing time_delta with LAG()
        LAG(vid_date + vid_time) OVER (ORDER BY title, vid_date, vid_TIME ) AS last_time,
        
        --create datetime column combining vid_date and vid_time cols
        (vid_date + vid_time) AS datetime,
        
        --CASE statements to calculate deltas of each statistic
        CASE 
        /*When title = LAG(title) ensures the following delta corresponds to videos with the same title
        This is needed because the records ORDER BY title, vid_date, vid_TIME
        So when the ordered records proceed to the next title, you don't care about the delta for A_title_views-B_title_views.
        */
        When title = LAG(title) OVER (ORDER BY title, vid_date, vid_time) 
        --Then calculate delta over stat 
        THEN viewcount - LAG(viewcount) OVER (ORDER BY title, vid_date, vid_time)
        END AS view_delta, 

        CASE 
        WHEN title = LAG(title) OVER (ORDER BY title, vid_date, vid_time) 
        THEN (vid_date + vid_time) - LAG(vid_date + vid_time) OVER (ORDER BY title, vid_date, vid_time)
        END AS time_delta,
        
        CASE 
        WHEN title = LAG(title) OVER (ORDER BY title, vid_date, vid_time) 
        THEN (commentcount) - LAG(commentcount) OVER (ORDER BY title, vid_date, vid_time)
        END AS commentcount_delta, 
        
        CASE 
        WHEN title = LAG(title) OVER (ORDER BY title, vid_date, vid_time) 
        THEN (likecount) - LAG(likecount) OVER (ORDER BY title, vid_date, vid_time)
        END AS likecount_delta, 
        
        CASE 
        WHEN title = LAG(title) OVER (ORDER BY title, vid_date, vid_time) 
        THEN (dislikecount) - LAG(dislikecount) OVER (ORDER BY title, vid_date, vid_time)
        END AS dislikecount_delta
        
        FROM video_stats
        INNER JOIN playlists ON playlists.resourceid=video_stats.resourceid
)
    --select * from CTE to do final transformations
    SELECT 	*, 
    --calculate time_delta_min to determine watch rate as view_per_min
    ROUND(CAST(EXTRACT(EPOCH FROM time_delta::INTERVAL)/60 AS NUMERIC), 2) AS time_delta_min, 
    ROUND(CAST(view_delta / (EXTRACT(EPOCH FROM time_delta::INTERVAL)/60) AS NUMERIC),2) as view_per_min
    FROM cte
    
    --where title = last_title verifies all deltas calc'd are a difference from the same title. Keeps data clean if anything slips through.
    WHERE title = last_title
    /*where view_delta IS NOT NULL removes all records where a delta wasnt calculated due to title != LAG(title). 
    This occurs ordered records proceed to the next title, so it results in NULL for A_title_views-B_title_views in our cte CASE statements
    */
    AND view_delta IS NOT NULL
) temptable
    """)
    
    sql_return = cur.fetchall()
    columns = [column[0] for column in cur.description]
    sql_df = pd.DataFrame(sql_return, columns = columns)

except Exception as err:
    print ("psycopg2 connect() ERROR:", err)

Transform some columns and datatypes to analyze more easily

In [7]:
sql_df['hour'] = sql_df['datetime'].dt.hour
sql_df['day'] = sql_df['datetime'].dt.day_name()
sql_df['day_num'] = sql_df['datetime'].dt.weekday
sql_df['view_per_min'] = sql_df['view_per_min'].astype(float)

Plot viewcount by position in the playlist

  • Value of view is actually defined by engagement with the ads, but businessinsider assumes 500USD/100,000 views

    • Depending on engagment and length it can range from 200USD/100,000 to 4,000USD/100,000
      • For example Gangnam Stule had 1B views and made 7.8M USD. Which is 780USD/100,000 views
  • The top video is HEYYEYAAEYAAAEYAEYAA with 171M views, ~855,000USD in revenue

  • The second most played is the start of the playlist, YEE with 78M views ~390,000USD in revenue

The links to both videos are in the df below the graph

In [8]:
top_views = sql_df[sql_df['datetime'] == sql_df.sort_values(
    by='datetime', ascending=False).loc[0,'datetime']].sort_values(
    by='viewcount', ascending=False)

fig = px.scatter(top_views, x="position", y="viewcount", color='title')
fig.update_layout(showlegend=False)
fig.show()

make_df_url(top_views[:2])
Out[8]:
title pl_url position videopublishedat vid_date viewcount last_view view_delta view_per_min datetime last_time time_delta time_delta_min commentcount commentcount_delta likecount likecount_delta dislikecount dislikecount_delta hour day day_num
26070 HEYYEYAAEYAAAEYAEYAA https://www.youtube.com/watch?v=ZZ5LpwO-An4&list=PLFsQleAWXsj_4yDeebiIADdH5FMayBiJo&index=280 280 2010-11-07 21:01:39 2020-07-31 171078784 171062516 16268 21.700000 2020-07-31 09:02:16 2020-07-30 20:32:41 0 days 12:29:35 749.58 310517.000000 34.000000 2838733.000000 548.000000 84607.000000 5.000000 9 Friday 4
70152 Yee https://www.youtube.com/watch?v=q6EoRBvdVPQ&list=PLFsQleAWXsj_4yDeebiIADdH5FMayBiJo&index=0 0 2012-02-29 19:47:08 2020-07-31 78533770 78525268 8502 11.340000 2020-07-31 09:02:16 2020-07-30 20:32:41 0 days 12:29:35 749.58 59756.000000 14.000000 870279.000000 227.000000 20184.000000 2.000000 9 Friday 4

Generate df with average views per day by video

In [9]:
avg_views_day = sql_df.groupby(by=['vid_date','title']).mean().reset_index()

Plot view_per_min by day over the past month to see if people are still watching

  • Our max, "Have you ever had a dream like this", recently averaged over 388 views per min on Aug 4
    • This averages out to 2,800USD for that day alone.
In [10]:
fig = px.line(avg_views_day, x="vid_date", y="view_per_min", color='title')

fig.update_layout(title="Views/Min of Title by Day",
    legend=dict(
        title="Legend",
        yanchor="top",
        y=-0.7,
        xanchor="center",
        x=0.5
))

fig.show()

findVideo("Have you ever")
Out[10]:
title pl_url position videopublishedat vid_date viewcount last_view view_delta view_per_min datetime last_time time_delta time_delta_min commentcount commentcount_delta likecount likecount_delta dislikecount dislikecount_delta hour day day_num
24410 Have you ever had a dream like this? https://www.youtube.com/watch?v=G7RgN9ijwE4&list=PLFsQleAWXsj_4yDeebiIADdH5FMayBiJo&index=279 279 2011-06-02 10:00:55 2020-08-22 48877167 48866417 10750 89.750000 2020-08-22 19:00:26 2020-08-22 17:00:39 0 days 01:59:47 119.78 nan nan 1043625.000000 289.000000 17426.000000 3.000000 19 Saturday 5
24647 Have you ever had a dream like this? (Metal Remix) https://www.youtube.com/watch?v=dXidW7fEH8g&list=PLFsQleAWXsj_4yDeebiIADdH5FMayBiJo&index=181 181 2013-05-12 00:49:20 2020-08-22 5677881 5677777 104 0.870000 2020-08-22 19:00:26 2020-08-22 17:00:39 0 days 01:59:47 119.78 nan nan 94985.000000 3.000000 1050.000000 -1.000000 19 Saturday 5

Plot averages views of the whole playlist by hour to see what hours people are watching most

  • Looks like more people watch in the afternoon with a peak ay 1900 CST, 7PM CST
    • also, looks like I have some dirty data in here. My script only runs on odd hours but we have entries at 1600 and 1800. We'll need to remove those from our data
In [11]:
fig = px.line(sql_df.groupby(by='hour').mean(), x=sql_df.groupby(by='hour').mean().index, 
              y=['view_per_min'])

fig.update_layout(
    title="Views/Min by Hour of Day",
    xaxis_title="hour of day",
    yaxis_title="view per min",
    legend_title="Legend",
)
fig.show()
In [12]:
sql_df[(sql_df['hour']!=16) & (sql_df['hour']!=18)].groupby(by='hour').mean().reset_index()
Out[12]:
hour position viewcount last_view view_delta view_per_min commentcount commentcount_delta likecount likecount_delta dislikecount dislikecount_delta day_num
0 1 155.0 7.344723e+06 7.344428e+06 295.115417 2.460577 6738.440628 0.485688 104858.364503 11.411003 2098.557997 0.181230 3.105263
1 3 155.0 7.345000e+06 7.344723e+06 277.110848 2.309831 6738.836196 0.395568 104868.574519 10.210015 2098.731562 0.173565 3.105263
2 5 155.0 7.345311e+06 7.345000e+06 310.841767 2.590206 6739.162881 0.326685 104878.328564 9.754045 2098.917731 0.186169 3.105263
3 7 155.0 7.345488e+06 7.345311e+06 177.683703 1.480729 6739.546076 0.383195 104887.265032 8.936467 2099.097428 0.179697 3.105263
4 9 155.0 7.343698e+06 7.343444e+06 254.221383 1.701043 6736.993860 0.510877 104814.623786 13.725890 2097.849191 0.218285 3.150000
5 11 155.0 7.345947e+06 7.345693e+06 253.914537 2.117360 6740.477378 0.511173 104912.235054 13.192131 2099.543178 0.248510 3.105263
6 13 155.0 7.346245e+06 7.345947e+06 297.925199 2.404585 6741.085134 0.607756 104926.763584 14.528530 2099.790496 0.247317 3.105263
7 15 155.0 7.348277e+06 7.347825e+06 452.728135 2.824749 6744.359825 0.688421 105005.506311 17.712460 2100.934790 0.247411 3.150000
8 17 155.0 7.343257e+06 7.342848e+06 408.876490 3.029406 6736.540557 0.527761 104798.793451 16.421283 2097.415953 0.227108 3.294118
9 19 155.0 7.350999e+06 7.350510e+06 489.695265 3.145617 6748.825678 0.646252 105098.099588 17.212857 2101.955869 0.213445 3.363636
10 21 155.0 7.345995e+06 7.345615e+06 380.250322 2.959844 6740.551930 0.520877 104906.514078 12.894013 2099.246117 0.199838 3.250000
11 23 155.0 7.344428e+06 7.344135e+06 292.830597 2.479665 6737.954940 0.387627 104846.953500 12.171691 2098.376767 0.176631 3.210526

Clean up the df and plot averages views again

  • That looks a lot cleaner and makes more sense for evening views
In [13]:
clean_df = sql_df[sql_df['hour'] % 2 > 0]

fig = px.line(clean_df.groupby(by='hour').mean().reset_index(), 
              x='hour', 
              y=['view_per_min'])

fig.update_layout(
    title="Views/Min by Hour of Day",
    xaxis_title="hour of day",
    yaxis_title="view per min",
    legend_title="Legend",
)
fig.show()

Plot averages views of the whole playlist by day to see what days people are watching most

  • Surprisingly more people are watching these videos in the middle of the week, Max = Wednesday, 2.76 view/min
    • They probably have better things to do on the weekends
In [14]:
fig = px.line(clean_df.groupby(by='day').mean().reset_index().sort_values(by='day_num'), 
              x='day', 
              y='view_per_min')
fig.show()

fig.write_html("./viz_img/scatter_viewcnt.html")
IFrame(src="./viz_img/scatter_viewcnt.html", width=1000, height=600)
In [15]:
fig = px.line(clean_df.groupby(by=['day','hour']).mean().reset_index(), 
              x='hour', 
              y=['view_per_min'], color='day')
fig.show()

Lets check our distributions with a violin plot

  • Since the max daily average is 4 we'll consider all data an order of magnitude greater outliers, so 10 > view_per_min
  • it's also interesting to see that some is negative, likely a data collection issue from youtube viewcount
In [16]:
cols = ['title', 'view_per_min', 'viewcount', 'last_view']
fig = px.violin(clean_df[clean_df['view_per_min']<10].sort_values(by='day_num'), 
                y="view_per_min", x="day", color="day",
          hover_data=sql_df[cols])
fig.show()

clean_df[clean_df['view_per_min']<0][:5]
Out[16]:
title pl_url position videopublishedat vid_date viewcount last_view view_delta view_per_min datetime ... time_delta_min commentcount commentcount_delta likecount likecount_delta dislikecount dislikecount_delta hour day day_num
306 7 Reasons Why You Should Be Listening To Mackl... https://www.youtube.com/watch?v=0nHL7IbKdDQ&li... 283 2013-08-03 01:07:51 2020-08-06 1134565 1134566 -1 -0.01 2020-08-06 09:00:45 ... 119.82 830.0 0.0 17345.0 0.0 921.0 0.0 9 Thursday 3
352 7 Reasons Why You Should Be Listening To Mackl... https://www.youtube.com/watch?v=0nHL7IbKdDQ&li... 283 2013-08-03 01:07:51 2020-08-10 1134641 1134642 -1 -0.01 2020-08-10 09:00:29 ... 119.80 830.0 0.0 17344.0 0.0 921.0 0.0 9 Monday 0
376 7 Reasons Why You Should Be Listening To Mackl... https://www.youtube.com/watch?v=0nHL7IbKdDQ&li... 283 2013-08-03 01:07:51 2020-08-12 1134668 1134669 -1 -0.01 2020-08-12 09:00:38 ... 119.78 830.0 0.0 17344.0 0.0 921.0 0.0 9 Wednesday 2
2202 andrew houston, ceo of dropbox, abuses hookers https://www.youtube.com/watch?v=oXHM9XOigUM&li... 138 2016-11-07 16:01:45 2020-08-06 323109 323111 -2 -0.02 2020-08-06 09:00:45 ... 119.82 119.0 0.0 2171.0 0.0 37.0 0.0 9 Thursday 3
2509 animal calling https://www.youtube.com/watch?v=93rrZabgzMw&li... 75 2019-11-05 19:02:36 2020-08-12 155272 155274 -2 -0.02 2020-08-12 09:00:38 ... 119.78 56.0 0.0 1489.0 0.0 5.0 0.0 9 Wednesday 2

5 rows × 22 columns